TABLE OF CONTENTS
  1. Introduction
    1.1 Industry Background
    1.2 Fleetmetrica
    1.3 Previous Study
    1.4 Research Issue
  2. Data
    2.1 Data Overview
    2.2 Data Cleaning
    2.3 Exploratory Feature Engineering: Case Control Study
  3. Modelling
    3.1 Reproducing Original Model
    3.2 Modelling
          3.2.1 Fitting on Company A
          3.2.2 The Main Model with Bootstrapping and Cross Validation
          3.2.3 Residual Check
          3.2.4 Testing on Company B
    3.3 Model Comparison
  4. Result and Conclusion
    4.1 Decision Criteria
    4.2 Limitations
    4.3 Conclusion


1. Introduction

This study proposes a five-star rating system developed from cox-ph model to classify truck drivers risks and helps relevant companies in the industry to make informed decisions. More than three million rows of driver’s data are used to train the survival model. Based on a previous model developed by the company, this study is attempted to improve the prediction precision and to generalize the results for different fleets using more significant features. In the following parts of this report, we will discuss the background, data and the model in greater detail.

1.1 Industry Background

There are 700,000 trucks across Canada, 420,000 of which are used to carry freight commercially. Therefore, large amounts of data is produced and stored every day. To make the best out of these data, nowadays, fleet companies introduce many advanced techniques to boost their operations. Fleet managers are tasked with a wide range of responsibilities such as GPS-based vehicle tracking, mechanical diagnostics of log events, and most recently, driver behavior analysis using machine learning and even deep learning techniques. With the growing maturity of big data algorithms and applications, the operational management efficiency of the transportation industry will be improved significantly.

1.2 Fleetmetrica

Fleetmetrica is a data analytics company in the transportation industry helping fleets leverage their telematics data for bottom-line results. Its solutions simplify the process of managing critical and time-consuming activities relating to key business issues such as safety, compliance, fuel, utilization, customer service, maintenance, and driver satisfaction and retention.

Fleetmetrica’s platform uses predictive analytics to automatically determine when and where to act in the fleet, and then makes this information available to the people that need it and in the format they want to receive it.

1.3 Previous Study

Previously, Fleetmetrica developed a survival model to evaluate the risk of accidents for each driver. The initial key findings resulting from monitoring driving performance with a telematics system and providing feedback to the driver as well as coaching and incentives for improved driving performance are as follows:

  • The resulting model shows some limited evidence of elevated risk for overspeeding, but only if driving on a highway. In addition, the risk of an accident drops when the driver spends a higher proportion of time overspeeding, which is unintuitive.

  • The risk of accidents is elevated over the baseline by an amount equal to:

Driver Risk Score = \(e^{35.57p-346.6p^2}\)


where p is the average of the proportion of highway speeding pings to total pings. The quantity peaks around p = 0.05 and decreases as p gets larger.

1.4 Research Issue

Since Fleetmetrica’s initial model has limited explanatory power, trains and tests on one company’s (A) data, this research will explore:

  • whether there are more predictive features for a new model through feature engineering
  • whether the new model can be generalized to other fleet companies (B and C)

2. Data

There are three companies(i.e. A, B, C) data related to this study. We use company A’s data for model training and validation; we also tried to test on company B and C’s data as well to check the model’s applicability.

2.1 Data Overview

  • Company A’s data has 5 files in csv format:

    • A_trips(\(\underline{driver\_id}\), distance, fuel_used, long_idle, engine_time, overspeed_time, drive_time, high_rpm_time, start, end) , 4304949 obs.
    • A_accidents(\(\underline{driver\_id}\), date), 1381 obs.
    • A_overspeeds(\(\underline{driver\_id}\), date, distance, speeding_pings, total_pings, speeding_pings_other, total_pings_other), 793345 obs.
    • A_speed_incidents(\(\underline{driver\_id}\), datetime, speed_limit, speed, latitude, longitude, place_description), 2861617 obs.
    • A_hard_brakes(\(\underline{driver\_id}\), datetime), 51209 obs.
  • Company B’s data has 4 files in csv format:

    • B_trips(\(\underline{driver\_id}\), distance, fuel_used, long_idle, engine_time, overspeed_time, drive_time, high_rpm_time, start,end, start_odom, end_odom, long_idle_count, long_idle_fuel, short_idle_time, short_idle_count, short_idle_fuel, cruise_events, cruise_time, cruise_fuel, cruise_distance) , 2157931 obs.
    • B_accidents(\(\underline{driver\_id}\),date), 1171 obs.
    • B_overspeeds(\(\underline{driver\_id}\), date, distance, speeding_pings, total_pings, speeding_pings_other, total_pings_other, speeding_pings_highway, total_pings_highway), 638222 obs.
    • B_speed_incidents(\(\underline{driver\_id}\), datetime, speed_limit, speed,GPS), 256854 obs.
  • Company C’s data has 3 files in csv format:

    • C_trips(\(\underline{driver\_id}\), distance, fuel_used, engine_time, overspeed_time, drive_time, high_rpm_time, start,end, inter_trip_idle_time, low_rpm_times, total_rpm_times, equipment_id), 2181978 obs.
    • C_accidents(\(\underline{driver\_id}\), date), 46 obs.
    • C_hard_brakes(\(\underline{driver\_id}\), datetime), 21976 obs.

To generalize the model, we need common features from all three companies; however if we want to improve the accuracy we should also explore company-specific features as well. Therefore, in the data-cleaning part, we will retain the common features of the three companies and company A-specific features at the same time. Since all files have the primary key – driver_id, we will join all the data-sets(overspeed, speeding_incidents, hard_brake) to trips and ran models on these joined data-sets. Data are not consistently available across the three companies : features like hard_brakes doesn’t exist for company B and overspeeding features are not available for company C. It is also worth mentioning that the accidents only have dates without exact time while relevant trips data is in the form of time-stamp. This could lead to inaccurate joining in some conditions. We tried to make use of the last trip as accident when there are multiple trips occurring on accident date, but finally didn’t keep this logic because it changed the distribution of the whole data-set (fitted model curve will be changed totally), which means this is a bad approximation. We will use the original data joining method, which will be elaborated later. The following plot shows what the accidents geographical distribution look like on a map:

Accidents Distribution, company A
Accidents Distribution, company B

2.2 Data Cleaning

  • Data connection: to create data that can be fed into a statistical model, we need to join all datasets with trips dataset of each company. Data joining is based on three conditions:
    • driver_id match in different sets;
    • accident date occurs later than trip start-datetime;
    • accident date occurs before trip end-datetime.

We didn’t adopt the method of approximating the final trip as the accident when multiple trips occur because this incorrectly classifies last one multi-day trip as non-accident, which in the end totally changes the true accident trips, even though it increases accidents to more than 200.

  • For company A, we connect trips data with accidents, overspeeding, hard brakes and incidents data. Noticeably, the accidents number drops sharply from 1381 to 128. This is mainly because of the lack of corresponding trips data. This lack comes from two parts: 1) accidents are earlier than trips; 2) there are missing trips during many time periods. Aware of these characteristics, we will model this dataset with 128 accidents and 3007820 rows.

  • For company B, trips are joined with accidents, overspeeding and incidents. No hard brakes data is available. Accidents are also sparse compared with total number of rows in trips, just like what happened to company A. After cleaning, Company B has 97 accidents and 1559401 rows.

  • For company C, there are only 12 events available after joining(46 accident events before joining); more importantly, critical variable such as long-idle does not exist, and overspeed and incidents data is not provided as well. Therefore, we will not run testing model on company C because of data insufficiency. No further discussion will be centered on company C. The following figure shows how the trip datetime of company A and B overlap with accidents dates respectively.

Density Histogram of Company A
  • Variable description: company A and company B have the same variables, which are described as follows:
    • driver_id : numeric, an unique identifier of a driver, which is used as a primary key to join different datasets;
    • julianstarts: numeric, datatime in days when a trip starts;
    • julianends: numeric, datetime in days when a trip ends;
    • drive_time: numeric, the time in seconds when trucks are moving;
    • engine_time: numeric, the time in seconds when trucks engines are started;
    • high_rpm_time: numeric, the engine time that the company deemed as high revolutions per minute;
    • distance: numeric, the moving distance in kilometers;
    • long_idle: numeric, the time when the truck is not moving with its engine on;
    • overspeed_time: numeric, the time when drivers are overspeeding;
    • fuel_used: numeric, total fuel used during a trip;
    • eventPrev: numeric 0-1, 1 represents accident happens during a trip and 0 means no accidents happening during a trip;
    • hiprops: numeric, a ratio between 0 and 1, representing the ratio of a drivers overspeeding pings to total pings on highway;
    • townprops: numeric, a ratio between 0 and 1, representing the ratio of a drivers overspeeding pings to total pings on townway;
    • o5props: numeric, a ratio between 0 and 1, representing the ratio of a drivers overspeeding pings to total pings and the excess speed over speed limit is greater than 5 km/h;
    • o10props: numeric, a ratio between 0 and 1, representing the ratio of a drivers overspeeding pings to total pings and the excess speed over speed limit is greater than 10 km/h;
    • o25props: numeric, a ratio between 0 and 1, representing the ratio of a drivers overspeeding pings to total pings and the excess speed over speed limit is greater than 25 km/h;
    • h5props: numeric, a ratio between 0 and 1, representing the ratio of a drivers highway overspeeding pings to total pings and the excess speed over speed limit is greater than 5 km/h;
    • h10props: numeric, a ratio between 0 and 1, representing the ratio of a drivers highway overspeeding pings to total pings and the excess speed over speed limit is greater than 10 km/h;
    • h25props: numeric, a ratio between 0 and 1, representing the ratio of a drivers highway overspeeding pings to total pings and the excess speed over speed limit is greater than 25 km/h;
    • t5props: numeric, a ratio between 0 and 1, representing the ratio of a drivers townway overspeeding pings to total pings and the excess speed over speed limit is greater than 5 km/h;
    • t10props: numeric, a ratio between 0 and 1, representing the ratio of a drivers townway overspeeding pings to total pings and the excess speed over speed limit is greater than 10 km/h;
    • t25props: numeric, a ratio between 0 and 1, representing the ratio of a drivers townway overspeeding pings to total pings and the excess speed over speed limit is greater than 25 km/h;
  • Data filtering: variables have negative values that do not make sense, such as negative distance or time, so we filtered on the following fields:
    • distance >= 0
    • fuel_used >= 0
    • engine_time >= 0,
    • overspeed_time >= 0,
    • drive_time >= 0,
    • julianstarts >= 0,
    • julianends >= 0
  • Clean data: table below shows the comparison of data before and after cleaning:
Raw Cleaned
# of features trip obs. accident obs. # of features trip obs. accident obs.
company A 20 4,304,949 1381 21 3,007,820 128
company B 33 2,157,931 1171 21 1,559,401 97

2.3 Exploratory Feature Engineering: Case Control Study

  1. Plot accidents frequency by month:

To understand the response variable and test the integrity of the data, it is helpful to check the average occurrence of accidents for company A and company B by plotting its frequency every 1 million trips. Company B’s data is not complete while company A accident data is reasonable because we don’t have any accidents happening in month 2,3,4,5,6 for company B. The monthly accidents frequency is shown below:

  1. Correlation matrix of variables

Using the monthly average method, we calculate monthly means of variables and do correlation analyses for those means to get an idea of whether certain variables could be potentially correlated to accidents. If a variable is highly correlated to the response variable, monthly sample means of the variable should also be closely correlated to the monthly sample means of the response variable. Put it in another way, if the sample means of variables are not highly correlated with that of response variable, further exploration of the relationship could be fruitless. The correlations of the monthly means of selected variables are as follows:

Features Correlation Matrix of Company A

This initial plot shows the correlation of variables’ means of each month from January to December. AccidProb is the monthly accident frequency per million trips; all variables in the matrix have monthly means with 12 data points. Ratio variables are calculated as follows:

  • idleDriveRatio = long_idle / drive_time ;
  • hRpmDriveRatio = high_rpm_time / drive_time ;
  • DriveEngineRatio = drive_time / engine_time ;
  • osDriveRatio = overspeed_time / drive_time ;
  • fuelDistRatio = fuel_used / distance

These ratios could be potential useful predictors for modeling as some of them show high correlation to accident frequency.

  1. case control study: odds ratio analysis

To further show that features have an influence on accident, we carry out a case study of odds ratio analysis on different variables. An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due to symmetry), the ratio of the odds of B in the presence of A and the odds of B in the absence of A. Two events are independent if and only if the OR equals 1, i.e., the odds of one event are the same in either the presence or absence of the other event. If the OR is greater than 1, then A and B are associated (correlated) in the sense that, compared to the absence of B, the presence of B raises the odds of A, and symmetrically the presence of A raises the odds of B. Conversely, if the OR is less than 1, then A and B are negatively correlated, and the presence of one event reduces the odds of the other event. The following will show the variables influence on accidents and respective ORs are calculated and plotted.

Odds ratios can be used to improve model training accuracy by limiting variable values to certain upper bounds. Since we use whole raw data without over-cleaning, here we simply show representative examples of case study, namely idleDriveRatio(idle_ratio in the plot) and hRpmDriveRatio(rpm_ratio in the plot). Through re-sampling we run multiple case control of variables and plot the representative odds ratio as follows:

Odds Ratios

When idleDriveRatio is larger than 0.5 we see the odds ratio is greater than 2.5, which means that this variable is constantly contributing to accidents. This is also verified by correlation matrix using monthly mean value. When hRpmDriveRatio is larger than 0.15, odds ratio falls to 0, which means the higher rpm the lower accidents rates.


3. Modelling

The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables. The Cox model is expressed by the hazard function denoted by h(t). Briefly, the hazard function can be interpreted as the risk of dying at time t. It can be estimated as follows:

h(t) = h0(t)exp(b1x1+b2x2+…+bpxp)


where: t represents the survival time, h(t) is the hazard function, h0 is the baseline hazard.

In this study, survival time is when the driver do not have accidents.

3.1 Reproducing Original Model

  • To reproduce the original model, we fit company A data using the only one feature hiprops,
##        
##               0       1
##   FALSE 2077028      67
##   TRUE   930664      61
  • The model as a whole is significant because it has the p values(likelihood ratio test, wald test and logrank test) less than 0.05 and a concordance of 0.558. The precision is not ideal at 70 percentile decision boundary, which is 0.48% (of the 128 events, only 61 accidents are predicted by the original model using 70% decision boundary). Even though the coefficients are a little different than the original ones, the shapes are similar:

  • The original model proposed that there is max risk for a driver when he has a highway proportional driving ping around 5.2%, and that of the reproduced model is around 6.3%. We could say that the original model can be reproduced in this study.

3.2 Modelling

3.2.1 Fitting on Company A

  • During model selection, we use an R library “mfp” to select polynomial terms of variables. Finally a model with log and interaction term is selected. The statistics p values and coefficients are satisfactory, as all predictors are significant at 0.001 level and the whole model is significant at 3e-06 level with a concordance of 0.623.

  • let’s test on the training set(we will discuss the over-fitting a little later), and check the prediction precision:

##        
##               0       1
##   FALSE 2075353      43
##   TRUE   932339      85

The precision is 85/(85+43) = 0.66, with the 70% fitted value as the decision boundary. In other words, the model predicts 85 out of 128 accidents. As the percentile value of fitted value increases, the precision decreases. We will also look at F1_score of the model.

3.2.2 The Main Model with Bootstrapping and Cross Validation

  • Reducing over-fitting using bootstrapping and cross-validation. We can not use the current model because it is over-fitted, which can lead to greater variance when it comes to out-of-sample testing, so we will do a simple cross-validation to split company A data into training set and development set with 80% and 20% of whole data-set. Then we re-run the model fitting procedures for 20 times to take the satisfactory-accuracy coefficients(with highest F1 scorw) as the final model output. The out-of-sample precision and F1_score are calculated 20 times using random splitting with replacement(bootstrapping). Precision is calculated by TRUE positives/(TRUE positives + FALSE positives); Recall is calculated by TRUE positives/(TRUE positives + FALSE negatives); F1_score is calculated by 2*Precision*Recall/(Precision + Recall). Accuracy measures are plotted as follows:

We take the fit that has the largest testing F1 score as the final model, which is given as follows:

## Call:
## coxph(formula = Surv(I(julianstarts - 0.1), julianends, eventPrev) ~ 
##     log(distance + 1) + I(idleDriveRatio + 0.1) + I((idleDriveRatio + 
##         0.1) * log((idleDriveRatio + 0.1))), data = cvi[[1]])
## 
##   n= 2393416, number of events= 93 
## 
##                                                            coef exp(coef)
## log(distance + 1)                                        0.4085    1.5045
## I(idleDriveRatio + 0.1)                                  1.6007    4.9566
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1))) -0.6258    0.5349
##                                                         se(coef)      z
## log(distance + 1)                                         0.0978  4.177
## I(idleDriveRatio + 0.1)                                   0.3618  4.424
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1)))   0.1695 -3.691
##                                                         Pr(>|z|)    
## log(distance + 1)                                       2.96e-05 ***
## I(idleDriveRatio + 0.1)                                 9.67e-06 ***
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1))) 0.000223 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                                                         exp(coef) exp(-coef)
## log(distance + 1)                                          1.5045     0.6647
## I(idleDriveRatio + 0.1)                                    4.9566     0.2018
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1)))    0.5349     1.8697
##                                                         lower .95 upper .95
## log(distance + 1)                                          1.2421    1.8224
## I(idleDriveRatio + 0.1)                                    2.4390   10.0727
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1)))    0.3836    0.7457
## 
## Concordance= 0.611  (se = 0.03 )
## Likelihood ratio test= 22.39  on 3 df,   p=5e-05
## Wald test            = 23.91  on 3 df,   p=3e-05
## Score (logrank) test = 20.05  on 3 df,   p=2e-04

The training set has 2393416 samples with 93 accidents(y=1). The out-of-sample testing has 614404 samples with 35 accidents (y=1). It is obvious that the F1_score is close to 0 because there are many false positive predictions. With such a low level of accident density in the whole data-set, predicting accidents with high recall is very challenging. Testing result on company A is shown below:

##        
##              0      1
##   FALSE 423929     10
##   TRUE  190440     25

3.2.3 Residual Check

Schoenfeld residual diagnostics. To test whether the proportional hazard assumption holds(independence between residuals and time), the function cox.zph() correlates for each predictor the corresponding set of scaled Schoenfeld residuals with time. The higher p value, the better for the model. p>0.05 is good, which means there is no strong evidence for relation between residuals and the time. This model does not show evidence of dependence between time and residuals. We also saw an improvement of global p-value using bootstrapping.

##                                                         chisq df    p
## log(distance + 1)                                       0.658  1 0.42
## I(idleDriveRatio + 0.1)                                 1.041  1 0.31
## I((idleDriveRatio + 0.1) * log((idleDriveRatio + 0.1))) 0.761  1 0.38
## GLOBAL                                                  2.536  3 0.47
  • Residual plot. We can visualize the correlation of predictors and residuals. The follow plot shows a solid flat line around 0, which means no correlations among residuals and predictors.

3.2.4 Testing on Company B

Using the final fitted model, we have the following testing result on Company B’s data.

##        
##               0       1
##   FALSE 1080955      22
##   TRUE   478349      75

In comparison, the testing result on B using original model has the following output:

result(fit0,b_data,70)
##        
##               0       1
##   FALSE 1078852      68
##   TRUE   480452      29

3.3 Model Comparison

Model comparison summary is as follows:

Model Concordance Model Significance: logrank Predictors Significance CompanyB Testing Precision CompanyA Testing Precision
Original 0.558 p=0.05 p=0.01 29/97=0.30 0.51
New 0.623 p=3e-06 p=2e-05 75/97 = 0.77 0.71
  • Model developed in this study improves the original model in two aspects:
  1. more robust: model concordance and predictors significance are improved dramatically, which means the new model is statistically better;

  2. more generalizable: New model’s out-of-sample testing precision is better than original model. which means the new model is better in practice. The following shows the trend of accident and non-accident predicting precision of the models.


4. Result and Conclusion

In this part, we will discuss model outputs that can be used for managerial decisions. Specifically, a five interval risk evaluation system was proposed to help decision making. Limitations of this study is also discussed. Final part concludes the study.

4.1 Decision Criteria

The following histogram is the distribution of fitted value in this study. To make the decision of classifying drivers riskiness easier, the fitted model values are split into five groups, with boundaries marked in red in the plot. If a new driver’s data is fed into the model, the driver fitted values are calculated; then we check which interval the value falls into. A driver risk score is evaluated in this way. Based on company A data, we have the following decision boundary: (-∞,-0.376],(-0.376,-0.025],(-0.025,0.204],(0.204,0.428],(0.428,+∞).

4.2 Limitations

From the plot above we know that using certain thresholds, we will classify the trips as either accidents or non-accident with fitted values. However the distribution is not clear cut; actually the two distributions overlap much in the middle. Finally we get a model with fair precision at the price of a low F1 score. It is quite challenging to improve this aspect under current data quality. Another concern is that through data joining, the original accidents number drop from over 1000 to around 100. It is the mismatch of records that leads to poor data quality. In addition, company B has no accidents available during Feb. to Jun. Therefore, to further improve the model, the solution is more matched trips and accidents data.

4.3 Conclusion

We conclude this study with the following model:

Driver Risk Score = \(e^{0.41*log(distance+1) + 1.6*(idleDriveRatio+0.1) - 0.63*(idleDriveRatio+0.1)*log(idleDriveRatio+0.1)}\)


where,

  • distance is the trips distance during certain time periods
  • idleDriveRatio is ratio of idle time to total time during trips

And the decision boundary is :

Decision Intervals Driver Risks Driver Ratings
(-∞,-0.376] very low ⭐⭐⭐⭐⭐
(-0.376,-0.025] low to medium ⭐⭐⭐⭐
(-0.025,0.204] medium ⭐⭐⭐
(0.204,0.428] medium to high ⭐⭐
(0.428,+∞) very high